Part 1 – getting ready for programmatic OCR
thomas.hegghammer@all-souls.ox.ac.uk
15 September 2025
Today:
Tomorrow:
https://www.youtube.com/watch?v=-VEJUzDFAZA
One of the oldest and hardest problems in computer science
Major progress from late 2010s onwards thanks to convolutional neural networks
Mostly solved for English, but not for Arabic
Adobe Acrobat Online:
onlineocr.net:
E.g. convertio.io:
Plus: Windows only, and $16+/month
/home/thomas/Pictures/photo.jpgC:\Users\Thomas\Pictures\photo.jpgpandas in Python or stringr in Rpytesseract in Python and tesseract in Rcurlgoogle-cloud-documentai, daiR)Go to webpage of repository
Locate the green “Code” button and click it
Copy the clone URL
Navigate to what you want to be the parent folder of the cloned repository
Clone
git clone CLONE_URL
# To get the example files etc:
git clone https://github.com/Hegghammer/ocr-cairoTo pull updates from a repository you have already cloned, simply run:
<-= operator and square bracketspandas, not natively# Create object
my_text = "Here is some text"
# Create lists
texts = ["Some text", "Some more text", "Even more"]
n_words = [2, 3, 2]
# Create dataframe
import pandas as pd
df = pd.DataFrame({
"Text": texts,
"Length": n_words
})
# Access second element of list
texts[1]
# Access third row in Length column
df["Length"].iloc[2]getwd()list.files()file.create(), dir.create()file.rename()file.remove(), unlink()# Create a directory and some files
dir.create("test")
file.create(c("test/notes1.txt", "test/notes2.txt"))
# Get content
contents <- list.files("test")
# Get full paths of content
contents_full <- list.files("test", full.names = TRUE)
# NB: Output of list.files is vector of paths,
# not files themselvesos moduleos.listdir() or glob.glob()open(FILE, "w")os.makedirs()shutil module and move()os.remove(), os.rmdir()import os
# Create a directory and some files
os.makedirs("test", exist_ok=True)
open("test/notes1.txt", "w").close()
open("test/notes2.txt", "w").close()
# Get content
contents = os.listdir("test")
# Get full paths
import glob
contents_full = glob.glob("test/*")
# Move file
import shutil
shutil.move("test/notes1.txt", ".")for (SEQUENCE_DEFINITION) {
INSTRUCTION
}
inenumerate() to get both index & valueFUNCTION_NAME <- function() {
INSTRUCTION
}
# Basic function
say_hello <- function() {
print("hello!")
}
say_hello()
# One parameter
greet <- function(name) {
greeting <- paste0("Hello, ", name, "!")
print(greeting)
}
greet("Rami")
# Two parameters
greet_n <- function(name, n_times) {
greeting <- paste0("Hello, ", name, "!")
rep(greeting, n_times)
}
greet_n("Nada", 5)def and colonreadr:read_file()write()stringr functionslibrary(lorem)
library(readr)
library(stringr)
library(tokenizers)
# Create random content
text <- as.character(ipsum(3))
# Save to file
write(text, "sample.txt")
# Load from file
same_text <- read_file("sample.txt")
# Do things with it
count_words(same_text)
smileys <- str_replace_all(same_text, "\\.", " 🙂")
write(smileys, "smileys.txt")re lets you substitute and find thingsimport lorem
import re
# Create random content
text = lorem.paragraph() * 3
# Save to file
with open("sample.txt", "w") as f:
f.write(text)
# Load from file
with open("sample.txt", "r") as f:
same_text = f.read()
# Do things with it
word_count = len(same_text.split())
smileys = re.sub(r"\.", " 🙂", same_text)
with open("smileys.txt", "w", encoding="utf-8") as f:
f.write(smileys)magick package is key, lets us:library(magick)
# Load an example jpeg
files <- list.files("example_docs/columns/orig",
full.names = TRUE
)
img <- image_read(files[1])
image_info(img)
# Get specific page of a PDF
img2 <- image_read_pdf(files[2], pages = 1)
# Make greyscale
img2_grey <- image_convert(img2, type = "Grayscale")
# crop a 200x150 region starting at (50, 100)
img2_crop <- image_crop(img2, "200x150+50+100")
# Save (use this to convert)
image_write(img2_crop, "test.png", format = "png")import glob
from PIL import Image
# Load an example jpeg
files = glob.glob("example_docs/columns/orig/*")
img = Image.open(files[0])
# Make greyscale
img_grey = img.convert("L")
# Crop a 200x150 region starting at (50, 100)
# PIL crop uses coordinates
# (left, top, right, bottom)
img_crop = img.crop((50, 100, 250, 250))
# Save (use this to convert)
img_crop.save("test.png", format="PNG")jiwer command line program, accessed in R via system()jiwer -r REFERENCE -h HYPOTHESIS
## WER
command <- "jiwer -g -r sample.txt -h smileys.txt"
wer <- as.numeric(system(command, intern = TRUE))
wer
# Inspect
system("jiwer -g -a -r sample.txt -h smileys.txt")
# CER (add -c)
command <- "jiwer -g -c -r sample.txt -h smileys.txt"
cer <- as.numeric(system(command, intern = TRUE))
cer
# Inspect
system("jiwer -g -a -c -r sample.txt -h smileys.txt")jiwer Python packageTwo very different problems
Most OCR engines try to do both, but often fail on complex layouts
Complex layouts often require “the cutout approach”:
Two main strategies: 1) Straight OCR; 2) The cutout approach
Currently, regular engines perform slightly better than VLLMs, but this may well change